A Quantification of Metagame Shifts in Professional League of Legends Gameplay¶

Name(s): Jason Tran

Website Link: https://dsc80.enscribe.dev, https://jktrns.github.io/league-metagame-analysis

In [1]:
import logging
import os
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List, Tuple

import numpy as np
import pandas as pd
import plotly.express as px
from scipy.stats import permutation_test
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import FunctionTransformer, LabelEncoder

from dsc80_utils import *

pd.options.plotting.backend = "plotly"

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

Step 1: Introduction¶

The data we are provided is from Oracle's Elixir, a historical database and analytics provider for the esports scene within League of Legends. This site is utilized by both professional analysts and community enthusiasts alike, and provides comprehensive match data across nearly all major leagues and competitions internationally.

The provided dataset from Oracle's Elixir consists of .csv files, where each .csv represents one year of match data. The data covers matches from 2014 up to present day (the 2024 set updates incrementally on a daily basis). Loading in a single game from the 2024 dataset:

In [2]:
display_df(
    pd.read_csv(
        "data/2024_LoL_esports_match_data_from_OraclesElixir.csv",
        low_memory=False,
    ).head(12)[
        [
            "gameid",
            "date",
            "side",
            "position",
            "teamname",
            "playername",
            "champion",
            "kills",
            "deaths",
            "assists",
        ]
    ],
    rows=12,
    cols=10,
)
gameid date side position teamname playername champion kills deaths assists
0 10660-10660_game_1 2024-01-01 05:13:15 Blue top LNG Esports Zika Aatrox 1 3 1
1 10660-10660_game_1 2024-01-01 05:13:15 Blue jng LNG Esports Weiwei Maokai 0 4 3
2 10660-10660_game_1 2024-01-01 05:13:15 Blue mid LNG Esports Scout Orianna 0 2 0
3 10660-10660_game_1 2024-01-01 05:13:15 Blue bot LNG Esports GALA Kalista 2 4 0
4 10660-10660_game_1 2024-01-01 05:13:15 Blue sup LNG Esports Mark Senna 0 3 3
5 10660-10660_game_1 2024-01-01 05:13:15 Red top Rare Atom Xiaoxu Rumble 4 0 6
6 10660-10660_game_1 2024-01-01 05:13:15 Red jng Rare Atom naiyou Rell 1 0 12
7 10660-10660_game_1 2024-01-01 05:13:15 Red mid Rare Atom VicLa LeBlanc 4 0 7
8 10660-10660_game_1 2024-01-01 05:13:15 Red bot Rare Atom Assum Varus 7 1 5
9 10660-10660_game_1 2024-01-01 05:13:15 Red sup Rare Atom Zorah Renata Glasc 0 2 13
10 10660-10660_game_1 2024-01-01 05:13:15 Blue team LNG Esports NaN NaN 3 16 7
11 10660-10660_game_1 2024-01-01 05:13:15 Red team Rare Atom NaN NaN 16 3 43

The dataset contains 168 columns that detail game-level statistics ranging from basic performance metrics (e.g. kills, deaths, and assists) to very nuanced and advanced analytics not even shown in the table above (e.g. damage share, vision control, creep score), and is invaluable for professional teams to analyze trends and make strategic decisions in gameplay.

Notice that the dataset's structure captures both macro-level game dynamics and individual player performance metrics. Each row represents a player's performance in a single game, with 10 rows per match (one for each player), plus two additional rows for team-level statistics. Key features include temporal data (patch numbers, timestamps for various objectives across the map), economic indicators (gold differences, resource distribution), and performance metrics (damage output, vision score). The data itself is very fine-grained and will allow for sophisticated analysis.

The patch number here really stands out as a key feature, and as a natural progression leads to the question: how can we quantify how the metagame shifts across different patches? As context, "metagame" (colloquially referred to as "meta") refers to the popular strategies and team compositions that provide advantages over other strategies/team compositions at a particular point in time—meta is most often caused by changes in game balancing (e.g. changes to champion abilities, item builds, maps, core game mechanics, etc.) that are introduced in a new patch. Since the dataset also contains regional data, we can ask how meta differs between regions, or how meta from a particular region plays against meta from another. Meta is also more pronounced in professional play, as most player have reached the theoretical skill ceiling and are mainly separated by strategy and composition—in more amateur or casual play, meta is less impactful compared to technical skill and "game sense"/knowledge.

Step 2: Data Cleaning and Exploratory Data Analysis¶

Data Cleaning¶

Since our dataset provides two layers of granularity (the player level and the team level), we can divide the dataset into two separate DataFrames: players and teams:

In [3]:
def load_match_data(years_range: range = range(2014, 2025)) -> pd.DataFrame:
    dfs = []
    for year in years_range:
        try:
            df = pd.read_csv(
                f"data/{year}_LoL_esports_match_data_from_OraclesElixir.csv",
                low_memory=False,
            )
            dfs.append(df)
        except FileNotFoundError:
            continue
    return pd.concat(dfs, ignore_index=True)


def split_player_team_data(
    matches_raw: pd.DataFrame,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    players_raw = matches_raw[
        matches_raw["position"].isin(["top", "jng", "mid", "bot", "sup"])
    ]
    teams_raw = matches_raw[matches_raw["position"] == "team"]
    return players_raw, teams_raw
In [4]:
matches_raw = load_match_data()
players_raw, teams_raw = split_player_team_data(matches_raw)

print("Shape of players DataFrame:", players_raw.shape)
print("Shape of teams DataFrame:", teams_raw.shape)

display_df(
    players_raw.iloc[:10][
        [
            "gameid",
            "date",
            "side",
            "teamname",
            "playername",
            "champion",
        ]
    ],
    rows=10,
    cols=7,
)

display_df(
    teams_raw.iloc[:2][
        [
            "gameid",
            "date",
            "teamname",
        ]
    ],
    rows=2,
    cols=4,
)
Shape of players DataFrame: (828380, 161)
Shape of teams DataFrame: (165676, 161)
gameid date side teamname playername champion
0 TRLH3/33 2014-01-14 17:52:02 Blue Fnatic sOAZ Trundle
1 TRLH3/33 2014-01-14 17:52:02 Blue Fnatic Cyanide Vi
2 TRLH3/33 2014-01-14 17:52:02 Blue Fnatic xPeke Orianna
3 TRLH3/33 2014-01-14 17:52:02 Blue Fnatic Rekkles Jinx
4 TRLH3/33 2014-01-14 17:52:02 Blue Fnatic YellOwStaR Annie
5 TRLH3/33 2014-01-14 17:52:02 Red Gambit Gaming Darien Dr. Mundo
6 TRLH3/33 2014-01-14 17:52:02 Red Gambit Gaming Diamondprox Shyvana
7 TRLH3/33 2014-01-14 17:52:02 Red Gambit Gaming Alex Ich LeBlanc
8 TRLH3/33 2014-01-14 17:52:02 Red Gambit Gaming Genja Lucian
9 TRLH3/33 2014-01-14 17:52:02 Red Gambit Gaming Edward Thresh
gameid date teamname
10 TRLH3/33 2014-01-14 17:52:02 Fnatic
11 TRLH3/33 2014-01-14 17:52:02 Gambit Gaming

After splitting the data into players and teams DataFrames, we notice that some columns contain only missing values. This is because certain columns only pertain to one granularity level but not the other. For example, player-specific columns like 'champion', 'position', and 'playername' will be empty in the teams DataFrame since they only apply to individual players, and team-level statistics like 'firstdragon' and 'firstblood' will be empty in the players DataFrame since they represent aggregate team performance rather than individual stats. We should clean these columns to avoid confusion when performing analysis:

In [5]:
def clean_empty_columns(df: pd.DataFrame) -> pd.DataFrame:
    return df.loc[:, ~df.isna().all()]
In [6]:
print("Columns that would be removed from players DataFrame:")
print(list(players_raw.columns[players_raw.isna().all()]))

print("Columns that would be removed from teams DataFrame:")
print(list(teams_raw.columns[teams_raw.isna().all()]))

print(
    f"Change in players columns after cleaning: {len(players_raw.columns)} -> {len(clean_empty_columns(players_raw).columns)}"
)
print(
    f"Change in teams columns after cleaning: {len(teams_raw.columns)} -> {len(clean_empty_columns(teams_raw).columns)}"
)
Columns that would be removed from players DataFrame:
['pick1', 'pick2', 'pick3', 'pick4', 'pick5', 'firstdragon', 'dragons', 'opp_dragons', 'elementaldrakes', 'opp_elementaldrakes', 'infernals', 'mountains', 'clouds', 'oceans', 'chemtechs', 'hextechs', 'dragons (type unknown)', 'elders', 'opp_elders', 'firstherald', 'heralds', 'opp_heralds', 'void_grubs', 'opp_void_grubs', 'firstbaron', 'firsttower', 'towers', 'opp_towers', 'firstmidtower', 'firsttothreetowers', 'turretplates', 'opp_turretplates', 'gspd', 'gpr']
Columns that would be removed from teams DataFrame:
['playername', 'playerid', 'champion', 'firstbloodkill', 'firstbloodassist', 'firstbloodvictim', 'damageshare', 'earnedgoldshare']
Change in players columns after cleaning: 161 -> 127
Change in teams columns after cleaning: 161 -> 153

Additionally, there are multiple columns in both datasets that are semantically boolean but are stored as either integers or floats. We should convert these columns to boolean type:

In [7]:
def get_boolean_columns(df: pd.DataFrame) -> list:
    bool_cols = []
    for col in df.columns:
        unique_vals = df[col].dropna().unique()
        if all(val in [0, 1] for val in unique_vals):
            bool_cols.append(col)
    return bool_cols


def convert_boolean_columns(df: pd.DataFrame) -> pd.DataFrame:
    bool_cols = get_boolean_columns(df)
    for col in bool_cols:
        df[col] = df[col].astype("boolean")
    return df
In [8]:
print("Columns that would be converted to boolean in players DataFrame:")
print(players_raw.pipe(clean_empty_columns).pipe(get_boolean_columns))

print("Columns that would be converted to boolean in teams DataFrame:")
print(teams_raw.pipe(clean_empty_columns).pipe(get_boolean_columns))
Columns that would be converted to boolean in players DataFrame:
['playoffs', 'result', 'firstblood', 'firstbloodkill', 'firstbloodassist', 'firstbloodvictim']
Columns that would be converted to boolean in teams DataFrame:
['playoffs', 'result', 'firstblood', 'firstdragon', 'firstherald', 'firstbaron', 'firsttower', 'firstmidtower', 'firsttothreetowers']

We can do some other miscellaneous cleaning, such as converting the date column to a datetime object, and adding padding to the patch column. We can also add a major_patch column to the players DataFrame (the section of the patch number to the left of the decimal point), which indicates a major separation in the game's mechanics:

In [9]:
def clean_patch_data(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    
    df["date"] = pd.to_datetime(df["date"])
    
    df["patch"] = df["patch"].apply(
        lambda x: (
            np.nan
            if pd.isna(x)
            else (
                str(x).split(".")[0].zfill(2) + "." + str(x).split(".")[1].zfill(2)
                if "." in str(x)
                else str(x).zfill(2) + ".00"
            )
        )
    )
    
    df["major_patch"] = df["patch"].str.split(".").str[0].str.zfill(2) + ".X"
    df["major_patch"] = pd.Categorical(
        df["major_patch"],
        categories=[f"{str(i).zfill(2)}.X" for i in range(3, 15)],
        ordered=True
    )
    
    return df
In [10]:
print("Major patch categories:")
print(players_raw.pipe(clean_patch_data)["major_patch"].unique())
Major patch categories:
['03.X', NaN, '04.X', '05.X', '06.X', ..., '10.X', '11.X', '12.X', '13.X', '14.X']
Length: 13
Categories (12, object): ['03.X' < '04.X' < '05.X' < '06.X' ... '11.X' < '12.X' < '13.X' < '14.X']

Taking a look at the unique values of the datacompleteness column, we can see that there are three categories: "complete", "partial", and "error". The "error" category is likely due to a data collection error:

In [11]:
print(
    teams_raw["datacompleteness"]
    .value_counts(normalize=True)
    .apply(lambda x: f"{x:.4f}")
)
datacompleteness
complete    0.8791
partial     0.1204
error       0.0005
Name: proportion, dtype: object

Since the "error" category only accounts for 0.05% of the data, we can safely drop it:

In [12]:
def drop_error_data(df: pd.DataFrame) -> pd.DataFrame:
    return df[df["datacompleteness"] != "error"]

We also notice that a lot of the values for columns pick1 through pick5 are missing:

In [13]:
print("Percentage of missing values in pick columns:")
for col, pct in (
    teams_raw.pipe(drop_error_data)[[f"pick{i}" for i in range(1, 6)]].isna().mean().mul(100).items()
):
    print(f"{col}: {pct:.2f}%")
Percentage of missing values in pick columns:
pick1: 24.26%
pick2: 24.26%
pick3: 24.26%
pick4: 24.26%
pick5: 24.26%

In professional play, there is a "draft phase" in which the ban and pick order is actually ordinal, and alternates between the two teams with the following system:

Draft Phase

The issue is that this dataset covers games from a time before this system was implemented, and as such we have many rows where the pick1, pick2, etc. columns are all missing even though their respective champion entries in the players DataFrame are not. We can't simply use the champion column entries as a fallback either, because the order of champions in the champion column is hardcoded as their role in the game: (1) top laner, (2) jungler, (3) middle laner, (4) bottom laner, and (5) support. Pick order is extraordinarily important in professional play and indicative of the team's strategy and priorities (and as such, a key component of metagame).

To impute these missing values, what we can do is calculate the "presence" of the champion across a particular timeframe surrounding that match (in this case, we will choose the patch number). We can do this by finding the percentage of games in which that champion appeared (as in getting picked or banned) for that patch, and then use that to determine the pick order. Although this is imperfect, it should provide a good enough approximation for our purposes.

Firstly, we will forward fill the patch column based on the chronological order of the matches, and then calculate the champion presence rate for each patch:

In [14]:
def fill_missing_patches(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df.copy()
        .sort_values("date")
        .assign(patch=lambda x: x["patch"].ffill())
    )

def calculate_patch_importance(teams_df: pd.DataFrame, players_df: pd.DataFrame) -> dict:
    players_filtered = players_df[["patch", "champion"]]
    patch_importance = {}
    
    for patch in teams_df["patch"].unique():
        patch_teams = teams_df.loc[
            teams_df["patch"] == patch,
            ["ban1", "ban2", "ban3", "ban4", "ban5"]
        ]
        
        patch_picks = players_filtered.loc[
            players_filtered["patch"] == patch,
            "champion"
        ]
        
        bans = patch_teams.values.ravel()
        valid_bans = bans[~pd.isna(bans)]
        
        total_games = len(patch_teams) / 2
        all_champs = np.concatenate([patch_picks.values, valid_bans])
        champion_counts = pd.Series(all_champs).value_counts()
        patch_importance[patch] = champion_counts / total_games
    
    return patch_importance

For example, here are the presence rates for the top 10 champions in patch 14.22:

In [15]:
display_df(
    pd.DataFrame(
        calculate_patch_importance(
            teams_raw.pipe(clean_patch_data).pipe(drop_error_data).pipe(fill_missing_patches),
            players_raw.pipe(clean_patch_data).pipe(drop_error_data).pipe(fill_missing_patches),
        )["14.22"]
        .sort_values(ascending=False)
        .head(10)
    )
    .reset_index()
    .rename(columns={"index": "champion", "count": "presence"})
    .set_index("champion"),
    rows=10,
)
presence
champion
Corki 1.00
Aurora 1.00
Skarner 1.00
Ashe 0.94
K'Sante 0.88
Yone 0.71
Vi 0.65
Orianna 0.65
Varus 0.59
Jax 0.59

For each row, we create:

  • A picks column that uses the pick1 through pick5 columns if they are not missing—otherwise, we impute it with the champion column entries from the players DataFrame, sorting them by the champion presence rates for that patch as an estimate of their pick priority.
  • A bans column that is simply the concatenation of whatever is in the ban1 through ban5 columns (it's already ordered so we don't need to sort):
In [16]:
def process_draft_data(
    teams_df: pd.DataFrame, players_df: pd.DataFrame
) -> pd.DataFrame:
    draft_df = teams_df.copy()

    pick_cols = [f"pick{i}" for i in range(1, 6)]
    ban_cols = [f"ban{i}" for i in range(1, 6)]

    team_picks = draft_df[pick_cols].values
    ordered_picks = [[pick for pick in row if pd.notna(pick)] for row in team_picks]
    draft_df["ordered_picks"] = ordered_picks

    player_picks = (
        players_df.sort_values(["gameid", "side", "position"])
        .groupby(["gameid", "side"], observed=True)
        .agg(player_picks=("champion", list))
        .reset_index()
    )

    draft_df = pd.merge(
        draft_df, player_picks, on=["gameid", "side"], how="left", validate="1:1"
    )

    presence_rates = calculate_patch_importance(teams_df, players_df)
    presence_lookup = {
        (patch, champ): rate
        for patch, champs in presence_rates.items()
        for champ, rate in champs.items()
    }

    def order_picks_by_presence(row):
        if len(row.ordered_picks) == 5:
            return row.ordered_picks
        return sorted(
            row.player_picks,
            key=lambda champ: presence_lookup.get((row["patch"], champ), 0),
            reverse=True,
        )

    draft_df["picks"] = draft_df.apply(order_picks_by_presence, axis=1)

    ban_data = draft_df[ban_cols].values
    bans = [[ban for ban in row if pd.notna(ban)] for row in ban_data]
    draft_df["bans"] = bans

    draft_df = draft_df.drop(
        pick_cols + ban_cols + ["ordered_picks", "player_picks"], axis=1
    )

    return draft_df
In [17]:
players = (
    players_raw
    .pipe(drop_error_data)
    .pipe(clean_empty_columns)
    .pipe(convert_boolean_columns)
    .pipe(clean_patch_data)
    .pipe(fill_missing_patches)
)

teams = (
    teams_raw
    .pipe(drop_error_data)
    .pipe(clean_empty_columns)
    .pipe(convert_boolean_columns)
    .pipe(clean_patch_data)
    .pipe(fill_missing_patches)
    .pipe(process_draft_data, players)
)

teams
Out[17]:
gameid datacompleteness url league ... opp_deathsat25 major_patch picks bans
0 TRLH3/33 complete http://matchhistory.na.leagueoflegends.com/en/... EU LCS ... 10.0 03.X [Annie, Vi, Jinx, Trundle, Orianna] [Riven, Kha'Zix, Yasuo]
1 TRLH3/33 complete http://matchhistory.na.leagueoflegends.com/en/... EU LCS ... 4.0 03.X [Thresh, LeBlanc, Lucian, Shyvana, Dr. Mundo] [Kassadin, Nidalee, Elise]
2 TRLH3/44 complete http://matchhistory.na.leagueoflegends.com/en/... EU LCS ... 7.0 03.X [Elise, Lucian, Lulu, Shyvana, Kayle] [Lee Sin, Annie, Yasuo]
... ... ... ... ... ... ... ... ... ...
165589 LOLTMNT01_180445 complete NaN NEXO ... 14.0 14.X [Wukong, Caitlyn, Nautilus, K'Sante, Taliyah] [Skarner, Corki, Ashe, Gnar, Ornn]
165590 LOLTMNT01_181121 complete NaN NEXO ... 11.0 14.X [Yone, Poppy, Rumble, Jinx, Vi] [Zyra, Nocturne, Aurora, Caitlyn, Kai'Sa]
165591 LOLTMNT01_181121 complete NaN NEXO ... 15.0 14.X [LeBlanc, Rell, K'Sante, Aphelios, Taric] [Skarner, Corki, Ashe, Rek'Sai, Sejuani]

165592 rows × 146 columns

Finally, to meet the "tidy data" requirement of having one observation per row, we can make it so that each match is represented by a single row instead of two. What we can do is prefix columns that are team-dependent with the team's color (e.g. blue_firstblood or red_teamname):

In [18]:
def format_matches_data(draft_data: pd.DataFrame) -> pd.DataFrame:
    base_cols = [
        "gameid",
        "datacompleteness",
        "url",
        "league",
        "year",
        "split",
        "playoffs",
        "date",
        "game",
        "patch",
        "major_patch",
        "gamelength",
    ]
    draft_cols = [col for col in draft_data.columns if col not in base_cols]

    blue_cols = {col: f"blue_{col}" for col in draft_cols}
    red_cols = {col: f"red_{col}" for col in draft_cols}

    blue_teams = draft_data[draft_data["side"] == "Blue"].rename(columns=blue_cols)
    red_teams = draft_data[draft_data["side"] == "Red"].rename(columns=red_cols)

    matches = (
        draft_data[base_cols]
        .drop_duplicates()
        .merge(blue_teams[["gameid"] + list(blue_cols.values())], on="gameid")
        .merge(red_teams[["gameid"] + list(red_cols.values())], on="gameid")
    )

    return matches
In [19]:
matches = format_matches_data(teams)
matches
Out[19]:
gameid datacompleteness url league ... red_opp_assistsat25 red_opp_deathsat25 red_picks red_bans
0 TRLH3/33 complete http://matchhistory.na.leagueoflegends.com/en/... EU LCS ... 23.0 4.0 [Thresh, LeBlanc, Lucian, Shyvana, Dr. Mundo] [Kassadin, Nidalee, Elise]
1 TRLH3/44 complete http://matchhistory.na.leagueoflegends.com/en/... EU LCS ... 16.0 6.0 [Thresh, Renekton, Caitlyn, Gragas, Vi] [Kassadin, Kha'Zix, Ziggs]
2 TRLH3/76 complete http://matchhistory.na.leagueoflegends.com/en/... EU LCS ... 4.0 4.0 [Renekton, Vi, Leona, Ziggs, Jinx] [Yasuo, Elise, LeBlanc]
... ... ... ... ... ... ... ... ... ...
82793 LOLTMNT01_180434 complete NaN NEXO ... 24.0 19.0 [Varus, Xin Zhao, Sylas, Gnar, Thresh] [Skarner, Corki, Seraphine, Gwen]
82794 LOLTMNT01_180445 complete NaN NEXO ... 23.0 14.0 [Wukong, Caitlyn, Nautilus, K'Sante, Taliyah] [Skarner, Corki, Ashe, Gnar, Ornn]
82795 LOLTMNT01_181121 complete NaN NEXO ... 19.0 15.0 [LeBlanc, Rell, K'Sante, Aphelios, Taric] [Skarner, Corki, Ashe, Rek'Sai, Sejuani]

82796 rows × 280 columns

We now have three DataFrames with three levels of granularity we can perform analysis on:

  • matches: one row per match, with team-specific columns
  • teams: one row per team per match, with team-specific columns
  • players: one row per player per match, with player-specific columns

Univariate Analysis¶

Professional League of Legends has a roster of 169 champions (with 168 available in professional play) as of November 2024, but only a subset are considered viable in professional play during any given meta. Although a bivariate analysis is more appropriate to explore meta shifts over time, we can still get a sense of the most "reliable" champions by looking at the top 20 most picked champions across the entire dataset. I'll color the bars by role for additional context, although this is a univariate analysis and thus it cannot be used to draw any conclusions:

In [20]:
role_colors = {
    "top": "#E57373",
    "jng": "#81C784",
    "mid": "#64B5F6",
    "bot": "#FFB74D",
    "sup": "#BA68C8",
}

role_names = {
    "top": "Top Lane",
    "jng": "Jungle", 
    "mid": "Mid Lane",
    "bot": "Bot Lane",
    "sup": "Support"
}

fig = px.bar(
    (
        players.groupby(["champion", "position"])["champion"]
        .count()
        .reset_index(name="picks")
        .sort_values("picks", ascending=False)
        .loc[
            lambda df: df["champion"].isin(
                df.groupby("champion")["picks"].sum().nlargest(20).index
            )
        ]
        .assign(
            position=lambda df: pd.Categorical(
                df["position"],
                categories=["top", "jng", "mid", "bot", "sup"],
                ordered=True,
            )
        )
        .sort_values("position")
    ),
    x="champion",
    y="picks", 
    color="position",
    title="Top 20 Most Picked Champions",
    labels={"champion": "Champion", "picks": "Number of Picks", "position": "Role"},
    color_discrete_map=role_colors,
    category_orders={"position": ["top", "jng", "mid", "bot", "sup"]},
    template="plotly_dark",
    width=650,
    height=600
).update_layout(
    xaxis={
        "categoryorder": "array",
        "categoryarray": (
            players.groupby(["champion", "position"])["champion"]
            .count()
            .reset_index(name="picks")
            .groupby("champion")["picks"]
            .sum()
            .sort_values(ascending=False)
            .head(20)
            .index
        ),
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"},
        "tickangle": 45
    },
    yaxis={
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"},
        "range": [0, None]
    },
    paper_bgcolor='#0A1428',
    plot_bgcolor='#0A1428',
    title={
        "font": {"family": "Beaufort"},
        "y": 0.95
    },
    font=dict(
        family="OpenSansRegular",
        color='#f0e6d2'
    ),
    margin=dict(b=100),
    showlegend=True,
    legend_title_text="Typical Role",
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="right", 
        x=0.99,
        bgcolor='rgba(10,20,40,0.8)',
        bordercolor='#f0e6d2'
    )
)

for i, role in enumerate(role_names.keys()):
    fig.data[i].name = role_names[role]

fig.show()
In [21]:
fig.write_html('charts/top-20-champions.html', include_plotlyjs='cdn')

The data reveals Nautilus as overwhelmingly the most picked champion of all time in professional play, followed by Ezreal and Braum. This immediately shows how specific champions are prioritized over others due to strategy—in particular:

  • Nautilus' kit provides strong engage tools and utility that allow him to set up team fights and engage targets of opportunity, which are highly valuable throughout all meta.
  • Ezreal is a very difficult character with a high skill ceiling, which might reap benefits in professional play.

We can also take a look at the most banned champions across the entire dataset. Banning is typically a strategy done to either remove a champion that synergizes well with the enemy team's playstyle/composition (or to remove the "main" champion that the enemy's star player/carry is likely to pick), or to remove a champion that is simply either annoying or overpowered in the current meta:

In [22]:
ban_data = pd.concat([
    pd.Series([ban for bans in matches["blue_bans"] for ban in bans]),
    pd.Series([ban for bans in matches["red_bans"] for ban in bans]),
])

role_data = players.groupby(['champion', 'position']).size().reset_index(name='count')
most_common_roles = role_data.sort_values('count', ascending=False).groupby('champion').first()

top_20_bans = ban_data.value_counts().head(20)

colors = [role_colors.get(most_common_roles.loc[champ, 'position'], '#C8AA6E') 
          if champ in most_common_roles.index else '#C8AA6E' 
          for champ in top_20_bans.index]

fig = px.bar(
    x=top_20_bans.index,
    y=top_20_bans.values,
    title="Top 20 Most Banned Champions",
    labels={"x": "Champion", "y": "Number of Bans"},
    template="plotly_dark",
    width=650,
    height=600,
).update_traces(marker_color=colors).update_layout(
    xaxis={
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"},
        "tickangle": 45
    },
    yaxis={
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"},
        "range": [0, None]
    },
    paper_bgcolor='#0A1428',
    plot_bgcolor='#0A1428',
    title={
        "font": {"family": "Beaufort"},
        "y": 0.95
    },
    font=dict(
        family="OpenSansRegular",
        color='#f0e6d2'
    ),
    margin=dict(b=100),
    showlegend=True,
    legend_title_text="Typical Role",
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="right",
        x=0.99,
        bgcolor='rgba(10,20,40,0.8)',
        bordercolor='#f0e6d2'
    )
)

for role, color in role_colors.items():
    fig.add_trace(go.Bar(
        x=[None], 
        y=[None],
        name=role_names[role],
        marker_color=color,
        showlegend=True
    ))

fig.show()
In [23]:
fig.write_html('charts/top-20-bans.html', include_plotlyjs='cdn')

LeBlanc is, by far and away, the most banned champion of all time. Although these results may not be indicative of the current meta, it seems as if at some point (or consistently), LeBlanc was a non-negotiable ban in professional play.

Bivariate Analysis¶

Moving onto bivariate analysis, we can take a look at the distribution of gold earned by each role across the dataset. In League of Legends, there are five primary roles that players assume, each with distinct responsibilities and strategic importance:

Role Description
Top The top laner is positioned in the top lane of the map. This player typically uses "tank" champions (characters that can absorb a lot of damage) or "bruisers" (characters that deal and withstand damage), who can initiate fights. They often play champions that excel in split-pushing (applying pressure on the map by attacking enemy structures while the rest of the team is elsewhere).
Jungle The jungler does not stay in a fixed lane but instead moves around the map, killing neutral monsters for gold and experience. This role is vital for map control, securing objectives like Dragon and Baron (powerful neutral monsters that provide team-wide benefits when defeated), and assisting other lanes by "ganking" (surprising enemy players in their lanes to help secure kills in outnumbered fights).
Mid The mid laner occupies the central lane and is crucial for controlling the map's center. This role usually involves playing "mages" (characters that use magic to deal damage) or "assassins" (characters that can quickly eliminate opponents), champions that deal significant damage and can roam to other lanes to assist teammates in securing kills.
Bot The bottom lane consists of two players: the ADC (Attack Damage Carry, responsible for dealing consistent physical damage, especially in the late game) and the Support (provides utility, vision, and protection for the ADC). The bot lane is a key area for team coordination and strategy.
Support The Support is part of the bottom lane duo and focuses on protecting the ADC. This role involves providing vision control with wards (items that reveal areas of the map), engaging or disengaging fights, and playing champions with crowd control abilities (skills that impair enemy movement or actions) and healing or shielding capabilities. The Support is essential for team fights and overall map awareness.

We can start with the gold distribution by role. Gold distribution is a key indicator of team resource allocation strategies. Different roles have different gold requirements based on their function within the team composition, and understanding these patterns helps reveal how teams prioritize their resources:

In [24]:
position_labels = {pos: role_names[pos] for pos in players['position'].unique()}

fig = px.box(
    players,
    x="position", 
    y="totalgold",
    title="Gold Distribution by Role",
    labels={"position": "Role", "totalgold": "Total Gold"},
    color="position",
    color_discrete_map=role_colors,
    category_orders={"position": list(position_labels.keys())},
    template="plotly_dark",
    width=650,
    height=600
).update_layout(
    paper_bgcolor='#0A1428',
    plot_bgcolor='#0A1428',
    title={
        "font": {"family": "Beaufort"},
        "y": 0.95
    },
    font=dict(
        family="OpenSansRegular",
        color='#f0e6d2'
    ),
    xaxis={
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"},
        "ticktext": list(position_labels.values()),
        "tickvals": list(position_labels.keys())
    },
    yaxis={
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"}
    },
    showlegend=False
)

fig.show()
In [25]:
fig.write_html('charts/gold-distribution-by-role.html', include_plotlyjs='cdn')

The boxplot clearly demonstrates the hierarchical nature of gold distribution. Bot lane (ADC) players show both the highest median gold and largest variance, reflecting their role as primary damage dealers who require costly item builds. Mid and top laners show similar distributions, indicating comparable farm priority, while junglers trail behind slightly due to their reliance on "jungle camps" (neutral monsters in the jungle) rather than minion waves (minions that spawn in the lanes, often in higher quantities). Support players, as expected, show significantly lower gold totals, as they are less likely to engage in direct combat and farm less gold overall.

We can also analyze "vision score" across roles, which is a measure of a player's contribution to vision control on the map. League of Legends has a "fog of war" mechanic that obscures the vast majority of the map from view—players need to either be in a position to see the enemy, or have vision-granting items/abilities (such as wards) to see the enemy and gather information:

In [26]:
fig = px.box(
    players,
    x="position", 
    y="visionscore",
    title="Vision Score Distribution by Role",
    labels={"position": "Role", "visionscore": "Vision Score"},
    color="position",
    color_discrete_map=role_colors,
    category_orders={"position": list(position_labels.keys())},
    template="plotly_dark",
    width=650,
    height=600
).update_layout(
    paper_bgcolor='#0A1428',
    plot_bgcolor='#0A1428',
    title={
        "font": {"family": "Beaufort"},
        "y": 0.95
    },
    font=dict(
        family="OpenSansRegular",
        color='#f0e6d2'
    ),
    xaxis={
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"},
        "ticktext": list(position_labels.values()),
        "tickvals": list(position_labels.keys())
    },
    yaxis={
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"}
    },
    showlegend=False
)

fig.show()
In [27]:
fig.write_html('charts/vision-score-distribution-by-role.html', include_plotlyjs='cdn')

Support players show dramatically higher vision scores with a wide distribution, reflecting their primary responsibility for map vision control. Junglers maintain the second-highest vision scores, which is used for securing objectives and tracking enemy movements. The relatively lower and similar vision scores among laners (top, mid, bot) suggest they focus more on farm and combat, relying on supports and junglers for primary vision control. This reveals a very concrete hierarchy between roles that is adhered to in professional play.

We can move onto patch distribution vs. game duration. Since patch numbers are categorical yet ordinal, we can convert the patch column into a new column major_patch that groups patches like 13.1, 13.2, 13.3, etc. together into "13.X" format:

In [28]:
fig = px.bar(
    (
        players.groupby("major_patch", observed=True)["gamelength"]
        .mean()
        .apply(lambda x: x / 60)
        .reset_index()
    ),
    x="major_patch", 
    y="gamelength",
    title="Average Game Duration by Major Patch Version",
    labels={
        "major_patch": "Major Patch",
        "gamelength": "Average Game Duration (minutes)",
    },
    template="plotly_dark",
    width=650,
    height=600,
    color_discrete_sequence=['#C8AA6E']
).update_layout(
    paper_bgcolor='#0A1428',
    plot_bgcolor='#0A1428',
    title={
        "font": {"family": "Beaufort"},
        "y": 0.95
    },
    font=dict(
        family="OpenSansRegular",
        color='#f0e6d2'
    ),
    xaxis={
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"}
    },
    yaxis={
        "title": {"font": {"family": "OpenSansRegular"}},
        "tickfont": {"family": "OpenSansRegular"}
    }
)

fig.show()
In [29]:
fig.write_html('charts/average-game-length-by-patch.html', include_plotlyjs='cdn')

The trend of decreasing game duration across patches likely reflects Riot Games' intentional game design changes. Through various patch updates, they have introduced mechanics that accelerate the pace of matches, such as increased gold generation, stronger objective rewards (like dragons and heralds), and changes to tower durability. These changes were probably made to speed up the game to make it both more engaging and less passive. In general, team damage has also gone up over time from a combination of both champion changes and item build changes.

Interesting Aggregates¶

Here is one of the most interesting phenomenons that we can observe from the data: the win rate of the "blue side" (the side on the bottom left half of the map) is significantly higher than the win rate of the "red side" (the side on the top right):

In [30]:
teams.groupby("side").agg(
    {
        "result": "mean",
        "firstblood": "mean",
        "firstdragon": "mean",
        "firstherald": "mean",
        "firsttower": "mean",
        "firstbaron": "mean",
    }
)
Out[30]:
result firstblood firstdragon firstherald firsttower firstbaron
side
Blue 0.53 0.51 0.44 0.56 0.55 0.49
Red 0.47 0.49 0.56 0.39 0.45 0.46

League of Legends Map Diagram

There are a couple reasons why it might be the case that the blue team has a higher win rate:

  • In the draft phase: if the state of the meta involves an overpowered champion with no true counter, the red team is forced to either ban the champion or allow the blue team to pick it with no trouble. It's good to note that in balanced patches the red team actually has an advantage in the draft phase because they have the first counterpick.
  • In the map layout: as evident from the above aggregate statistics, the blue team is more likely to secure the Rift Herald/Baron Nashor objectives (located in the same pit) due to the placement of walls on the map. Although the red team is significantly more likely to secure the Dragon objective, the Dragon is objectively riskier to fight due to its difficult-to-dodge attacks. Three players on the bottom half of the map (the jungler, ADC, and Support) also risk instigating a large team fight if they attempt to fight the Dragon.
  • Ganking opportunities: the blue side has shrubbery in the jungle which allows for surprise attacks on the red team's top laner, while the red team has shrubbery that allows for surprise attacks on the blue team's bottom laners. Although this seems like an equal advantage, the issue is that the top lane only has a single player (which makes it difficult to escape a gank) while the bottom lane has two players that can help each other escape. As such, there are particular champions on the top lane that have significantly higher win rate disparities between the two sides.
  • Camera placement: the camera angle in League of Legends is not a perfect top-down view of the map, but rather angled such that the blue side is slightly closer to the viewer. This makes the blue team have significantly better visibility. This is the biggest disadvantage of the red side.

Step 3: Assessment of Missingness¶

For this analysis, we need to carefully consider how to select a column with non-trivial missingness. In this dataset, we observe that many columns tend to be missing together in groups (for example, all ban-related columns are typically missing simultaneously). Rather than arbitrarily selecting one of these co-missing columns, we can leverage the dataset's 'datacompleteness' indicator column, which marks whether each game's data is fully complete or partial. By analyzing how well this indicator aligns with actual missing data patterns, we can validate it as a reliable proxy for missingness. If validated, we can use this indicator column for our permutation tests instead of individual missing columns, giving us a more meaningful measurement of missingness.

In order to validate the 'datacompleteness' indicator as a proxy for missingness, we create a table that shows the percentage of data that is missing for each column, separated by whether it is marked as 'complete' or 'partial'. To quantify the disparity in missingness between complete and partial data, we engineer a new feature 'disparity_score' for each column using the formula:

$$ \texttt{disparity\_score} = \frac{\texttt{partial\_missing\_prop} - \texttt{complete\_missing\_prop}}{1 + \texttt{complete\_missing\_prop}} $$

This score helps us understand which columns are most affected by incomplete data collection, and is interpretable as such:

  • A score close to 1 indicates the column is usually present in complete data but missing in partial data
  • A score close to 0 indicates similar missingness between complete and partial data
  • A negative score would indicate more missingness in complete data than partial data (which would be unusual)
In [31]:
missingness_by_completeness = (
    pd.DataFrame(
        {
            category: teams[teams["datacompleteness"] == category].isnull().mean()
            for category in teams["datacompleteness"].unique()
        }
    )
    .assign(
        disparity_score=lambda df: (df["partial"] - df["complete"])
        / (1 + df["complete"])
    )
    .sort_values("disparity_score", ascending=False)
)

missingness_by_completeness[["complete", "partial", "disparity_score"]].round(2)
Out[31]:
complete partial disparity_score
clouds 0.00 0.99 0.99
oceans 0.00 0.99 0.99
elders 0.00 0.99 0.99
... ... ... ...
monsterkillsownjungle 0.41 0.02 -0.28
url 0.39 0.00 -0.28
dragons (type unknown) 0.96 0.01 -0.48

146 rows × 3 columns

We arbitrarily bucket the disparity scores into five categories: "Very High", "High", "Moderate", "Low", and "Negative":

In [32]:
pd.DataFrame(
    missingness_by_completeness["disparity_score"]
    .apply(lambda x: "Very High" if x > 0.8 
           else "High" if x > 0.5 
           else "Moderate" if x > 0.2
           else "Low" if x >= 0
           else "Negative")
    .value_counts(),
    columns=["count"]
).reindex(["Very High", "High", "Moderate", "Low", "Negative"])
Out[32]:
count
disparity_score
Very High 79
High 3
Moderate 12
Low 43
Negative 9

As we can see, the vast majority of columns fall into the "Very High" category, with the "Low" bucket trailing behind and the "Moderate" and "High" categories low in frequency. Disturbingly, there are columns that fall into the "Negative" category, which would indicate that the column is more likely to be missing in complete data than partial data. Taking a look at these:

In [33]:
display_df(
    missingness_by_completeness[
        missingness_by_completeness["disparity_score"] < 0
    ][["complete", "partial", "disparity_score"]].round(2),
    rows=9,
)
complete partial disparity_score
teamname 0.00 0.00 -0.00
game 0.00 0.00 -0.00
teamid 0.01 0.01 -0.01
split 0.23 0.11 -0.10
total cs 0.96 0.65 -0.16
monsterkillsenemyjungle 0.41 0.02 -0.28
monsterkillsownjungle 0.41 0.02 -0.28
url 0.39 0.00 -0.28
dragons (type unknown) 0.96 0.01 -0.48

Looking through these columns, each are most likely an result of changes to either the game or the data collection process. For example, 'dragons (type unknown)' is likely an artifact of the data collection process not yet specifying what type of dragon was slain, so it was categorized as "unknown". In general, these columns are most likely statistics that were not tracked in older matches, so they are marked as "complete" (since that's all that was available at the time) despite lacking newer metrics that were introduced to the data collection process later in the game's lifespan.

However, since the overwhelming majority of columns fall into the positive realm of disparity scores (meaning that the column is more likely to be missing in partial data), we can safely say that the 'datacompleteness' indicator is a reliable proxy for missingness.

Moving onto missingness dependency analysis itself, we can reason that whether the data is complete is dependent on the 'league' column—this is because different leagues most likely have different practices for storing and transmitting historical match data. Below, the DataFrame completeness_percentages shows leagues that have some incomplete data (we're filtering out leagues with 100% completeness):

In [34]:
display_df(
    pd.crosstab(teams["league"], teams["datacompleteness"])
    .div(pd.crosstab(teams["league"], teams["datacompleteness"]).sum(axis=1), axis=0)[
        pd.crosstab(teams["league"], teams["datacompleteness"]).div(
            pd.crosstab(teams["league"], teams["datacompleteness"]).sum(axis=1), axis=0
        )["complete"]
        < 1
    ]
    .sort_values("partial", ascending=False)
    .apply(lambda x: x.map("{:.4f}".format)),
    rows=22,
)
datacompleteness complete partial
league
ASCI 0.0000 1.0000
LSPL 0.0000 1.0000
DCup 0.2649 0.7351
LDL 0.3707 0.6293
LPL 0.3825 0.6175
MSI 0.8730 0.1270
UPL 0.9058 0.0942
WLDs 0.9671 0.0329
LLA 0.9957 0.0043
LCS 0.9970 0.0030
TCL 0.9977 0.0023
LEC 0.9981 0.0019
LCSA 0.9981 0.0019
LFL2 0.9985 0.0015
TAL 0.9987 0.0013
LCK 0.9989 0.0011
OPL 0.9991 0.0009
LMS 0.9992 0.0008
CK 0.9993 0.0007
LJL 0.9994 0.0006

Informally, we can observe that some leagues have systematically more partial data than others (some even had "error" data that we removed). To formally test this, we can perform a permutation test to see if the missingness of data is dependent on the league. Establishing the null and alternative hypotheses, we have:

  • Null hypothesis: The missingness of data (completeness) is independent of the league.
  • Alternative hypothesis: The missingness of data (completeness) is dependent on the league.

For our test statistic, we can use the total variation distance (TVD) between complete and incomplete data. More intuitively, if we look at what percentage each league makes up of complete data vs. incomplete data, TVD measures how different these percentages are (e.g. if LCS makes up 30% of complete data but only 5% of incomplete data, meaning that the majority of its data is complete, that contributes to a larger TVD). The larger the TVD, the more likely we can reject the null hypothesis in favor of the alternative. We first define tvd_statistic as our test statistic function, and a helper function plot_tvd_distribution to visualize the null distribution against the observed test statistic:

In [35]:
def tvd_statistic(x, y):
    x_counts = pd.Series(x).value_counts(normalize=True)
    y_counts = pd.Series(y).value_counts(normalize=True)

    all_categories = x_counts.index.union(y_counts.index)

    x_counts = x_counts.reindex(all_categories, fill_value=0)
    y_counts = y_counts.reindex(all_categories, fill_value=0)

    tvd = 0.5 * np.abs(x_counts - y_counts).sum()
    return tvd


def plot_tvd_distribution(result, title="Null Distribution of Test Statistic Under Null Hypothesis", x_label="Total Variation Distance", y_label="Frequency"):
    fig = px.histogram(
        result.null_distribution,
        nbins=30,
        title=title,
        template="plotly_dark",
        width=650,
        height=600,
        color_discrete_sequence=['#C8AA6E']
    ).update_layout(
        paper_bgcolor='#0A1428',
        plot_bgcolor='#0A1428',
        title={
            "font": {"family": "Beaufort"},
            "y": 0.95
        },
        font=dict(
            family="OpenSansRegular",
            color='#f0e6d2'
        ),
        xaxis={
            "title": {"text": x_label, "font": {"family": "OpenSansRegular"}},
            "tickfont": {"family": "OpenSansRegular"}
        },
        yaxis={
            "title": {"text": y_label, "font": {"family": "OpenSansRegular"}},
            "tickfont": {"family": "OpenSansRegular"}
        },
        showlegend=False
    )

    fig.add_vline(
        x=result.statistic,
        line_dash="dash", 
        line_color="#E57373",
        line_width=3,
        annotation_text="Observed TVD  ",
        annotation_position="top left"
    )

    fig.update_traces(marker_line_width=0)
    return fig
In [36]:
missingness_1 = permutation_test(
    data=(
        teams.loc[teams["datacompleteness"] == "complete", "league"].values,
        teams.loc[teams["datacompleteness"] != "complete", "league"].values,
    ),
    statistic=tvd_statistic,
    permutation_type="independent",
    vectorized=False,
    n_resamples=1000,
    alternative="greater",
    random_state=42,
)

print(f"Observed TVD: {missingness_1.statistic:.4f}")
print(f"P-value: {missingness_1.pvalue:.4f}")
Observed TVD: 0.9061
P-value: 0.0010
In [37]:
fig = plot_tvd_distribution(missingness_1, title="Null Distribution of TVD Under Null Hypothesis: League vs. Completeness", x_label="Total Variation Distance", y_label="Frequency")
fig.show()
fig.write_html('charts/missingness-1.html', include_plotlyjs='cdn')

Assuming a significance level of 0.05, given that $p = 0.0010$ we reject the null hypothesis, which suggests that the missingness of data is dependent on the league.

Moving on, we can perform a similar analysis on the 'side' column. We have the null and alternative hypotheses:

  • Null hypothesis: The missingness of data (completeness) is independent of the side.
  • Alternative hypothesis: The missingness of data (completeness) is dependent on the side.

For our test statistic, we can again use the TVD between complete and incomplete data:

In [38]:
missingness_2 = permutation_test(
    data=(
        teams.loc[teams["datacompleteness"] == "complete", "side"].values,
        teams.loc[teams["datacompleteness"] != "complete", "side"].values,
    ),
    statistic=tvd_statistic,
    permutation_type="independent",
    vectorized=False,
    n_resamples=1000,
    alternative="greater",
    random_state=42,
)

print(f"Observed TVD: {missingness_2.statistic:.4f}")
print(f"P-value: {missingness_2.pvalue:.4f}")
Observed TVD: 0.0000
P-value: 1.0000
In [39]:
fig = plot_tvd_distribution(missingness_2, title="Null Distribution of TVD Under Null Hypothesis: Side vs. Completeness", x_label="Total Variation Distance", y_label="Frequency")
fig.show()
fig.write_html('charts/missingness-2.html', include_plotlyjs='cdn')

We have an observed TVD of $0.00$ and a p-value of $1.00$, and as such we fail to reject the null hypothesis. This suggests that the missingness of data is independent of the side.

Step 4: Hypothesis Testing¶

In our previous analysis, we observed that the blue side has about a 3% higher win rate than the red side when looking at data across all time. Although we reasoned as to why that might be the case using domain knowledge, we want to determine if this observed difference is statistically significant. We establish the following hypotheses:

  • Null hypothesis: There is no systematic difference in win rates between the blue and red sides. Any observed differences in win rates can be attributed to random variation in the data.
  • Alternative hypothesis: The blue side has a systematically higher win rate than the red side, beyond what would be expected by random chance alone. This suggests that side selection provides a meaningful competitive advantage.

For our test statistic, we will be using the difference in means of win rates between the blue and red sides, as this directly measures the effect we are interested in—the impact of side selection on match outcomes.

In [40]:
def diff_in_means(x, y):
    return x.mean() - y.mean()


hypothesis_1 = permutation_test(
    data=(
        teams[teams["side"] == "Blue"]["result"],
        teams[teams["side"] == "Red"]["result"],
    ),
    statistic=diff_in_means,
    permutation_type="independent",
    vectorized=False,
    n_resamples=1000,
    alternative="greater",
    random_state=42,
)

print(f"Observed difference in win rates: {hypothesis_1.statistic:.4f}")
print(f"P-value: {hypothesis_1.pvalue:.4f}")
Observed difference in win rates: 0.0646
P-value: 0.0010
In [41]:
fig = plot_tvd_distribution(hypothesis_1, title="Null Distribution of Difference in Means Under Null Hypothesis: Blue vs. Red Side", x_label="Difference in Win Rates", y_label="Frequency")
fig.show()
fig.write_html('charts/hypothesis-1.html', include_plotlyjs='cdn')

Assuming a significance level of 0.05 and given that $p = 0.0010$, we reject the null hypothesis. The test suggests that side selection provides a meaningful competitive advantage for the blue side.

Step 5: Framing a Prediction Problem¶

In our earlier analysis, we analyzed how the metagame in professional League of Legends gameplay propagates within the data it generates. We observed trends in champion pick and ban rates, role-specific gold distributions, and how average game durations have evolved over time. We also observed that the blue side has a systematic advantage over the red side in terms of win rate, another byproduct of metagame.

Building upon this understanding, we can frame a prediction problem: is it possible to predict the patch version of a professional League of Legends match based solely on the data from that particular match?

This prediction task is a multiclass classification problem, where the response variable is a patch version of the game (e.g. "13.1", "13.2", etc). If this approach is too fine-grained, we can opt for the approach of grouping minor patch updates under a common umbrella (e.g. "13.X"). We chose this prediction problem because champion selections, bans, and other statistics are highly influenced by the current meta, which is, in turn, shaped by the balance changes introduced in each patch. Certain champions may become more viable due to buffs, nerfs, or reworks (or simply their introduction as an overtuned character), leading to shifts in their pick and ban rates. The presence and absence of particular champions can thus be used to predict the patch version of a match.

For evaluating the model, we will use:

  • Accuracy as our primary metric, which is more suitable than, let's say, F1 score, because there is no punishment for type 1 or type 2 errors
  • Mean absolute error of our predictions (encoded as incremental integers) as a secondary metric to see, if we were wrong, how wrong we were
  • Accuracy of the major patch version (e.g. "13.X" instead of "13.14") as a tertiary metric to see how well we can predict extreme meta shifts rather than the nuanced ones across minor updates

Since we are only feeding the model features from the draft phase, we cannot use data from the game itself (e.g. kills obtained by a champion, who won the game, etc) to draw conclusions. However, we can still use overarching metadata like the league, team name, etc. if we so wish, since that data is available at the time the draft phase occurs.

Step 6: Baseline Model¶

For this model, we'll be using a RandomForestClassifier to make classifications, and a LabelEncoder to convert the patch version into a numerical format. For this baseline model, we'll encode the picks and bans of a match into a binary vector that indicates the presence or absence of a champion (1 if the champion is picked/banned by either team, 0 otherwise). To do so, we define transform_picks_bans to transform the DataFrame X into a binary matrix of shape (n_samples, n_champions), where each row corresponds to a match and each column corresponds to a champion. We will use this as part of a FunctionTransformer in our Pipeline to preprocess the data:

In [42]:
def transform_picks_bans(X):
    all_picks_bans = []
    for column in ["blue_picks", "red_picks", "blue_bans", "red_bans"]:
        picks_bans = X[column]
        all_picks_bans.extend([champ for picks in picks_bans for champ in picks])
    unique_champions = sorted(set(all_picks_bans))

    features = np.zeros((len(X), len(unique_champions)))
    champion_to_idx = {champ: idx for idx, champ in enumerate(unique_champions)}

    for i, (col, row) in enumerate(X.iterrows()):
        all_champs = (
            row["blue_picks"] + row["red_picks"] + row["blue_bans"] + row["red_bans"]
        )
        for champ in all_champs:
            if champ in champion_to_idx:
                idx = champion_to_idx[champ]
                features[i, idx] = 1

    features_df = pd.DataFrame(
        features, columns=[f"{champ}_presence" for champ in unique_champions]
    )
    return features_df

From here, we drop the rows with NaN patch values and define our feature matrix X and target vector y:

In [43]:
X = matches[["blue_picks", "red_picks", "blue_bans", "red_bans"]]
y = matches["patch"]

display_df(X)
display(y)
print(f"Number of unique patches: {len(y.unique())}")
blue_picks red_picks blue_bans red_bans
0 [Annie, Vi, Jinx, Trundle, Orianna] [Thresh, LeBlanc, Lucian, Shyvana, Dr. Mundo] [Riven, Kha'Zix, Yasuo] [Kassadin, Nidalee, Elise]
1 [Elise, Lucian, Lulu, Shyvana, Kayle] [Thresh, Renekton, Caitlyn, Gragas, Vi] [Lee Sin, Annie, Yasuo] [Kassadin, Kha'Zix, Ziggs]
2 [Thresh, Gragas, Lee Sin, Shyvana, Vayne] [Renekton, Vi, Leona, Ziggs, Jinx] [Kassadin, Annie, Orianna] [Yasuo, Elise, LeBlanc]
... ... ... ... ...
82793 [Ashe, Vi, Ahri, Renekton, Rell] [Varus, Xin Zhao, Sylas, Gnar, Thresh] [LeBlanc, Aurora, Yone, Renata Glasc, Bard] [Skarner, Corki, Seraphine, Gwen]
82794 [Nocturne, Rell, Hwei, Zac, Yone] [Wukong, Caitlyn, Nautilus, K'Sante, Taliyah] [Zyra, Aurora, Kai'Sa, Olaf, Syndra] [Skarner, Corki, Ashe, Gnar, Ornn]
82795 [Yone, Poppy, Rumble, Jinx, Vi] [LeBlanc, Rell, K'Sante, Aphelios, Taric] [Zyra, Nocturne, Aurora, Caitlyn, Kai'Sa] [Skarner, Corki, Ashe, Rek'Sai, Sejuani]

82796 rows × 4 columns

0        03.15
1        03.15
2        03.15
         ...  
82793    14.22
82794    14.22
82795    14.22
Name: patch, Length: 82796, dtype: object
Number of unique patches: 220

The shape of X should be (number of features, number of champions):

In [44]:
print(f"Shape after transformation: {transform_picks_bans(X).shape}")
transform_picks_bans(X)
Shape after transformation: (82796, 168)
Out[44]:
Aatrox_presence Ahri_presence Akali_presence Akshan_presence ... Ziggs_presence Zilean_presence Zoe_presence Zyra_presence
0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ...
82793 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
82794 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
82795 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0

82796 rows × 168 columns

From here, we split the data into training and testing sets, and then throw everything into our Pipeline using make_pipeline. (Note: this step is supposed to be done in Part 7, but I'm doing it here since the hyperparameters are needed for the initial model regardless.) The hyperparameters were chosen after a manual tuning process, as a grid search (or even the experimental HalvingGridSearchCV) would be extremely costly (one iteration of fitting takes 1 minute to run). Explaining each parameter:

  • n_estimators=500: A higher number of trees generally leads to better performance by reducing variance, though with diminishing returns. I found that 500 trees provided a good balance between model performance and training time after a few trials.
  • max_depth=16: This was my primary focus when doing the manual tuning, as I wanted to find that threshold where the model would still be able to capture complex patterns in the champion combinations, but not so deep that it would overfit (which is surprisingly easy to do with this type of model).
  • max_features="sqrt": Using square root for splits helps reduce correlation between trees and prevents overfitting.
  • min_samples_split=16: Requiring at least 16 samples to split a node helps ensure splits are statistically meaningful and reduces overfitting. This was also chosen after a few trials, and I often changed it in conjunction with min_samples_leaf (typically increasing both to reduce overfitting).
  • min_samples_leaf=4: I found that too low of a value for this would cause the model to fit noise in the data, so I settled on 4 after a few trials.
  • class_weight="balanced": Since patches may have uneven numbers of games, balanced weights ensure the model learns equally from all patches.
  • random_state=42: Fixed seed for reproducibility.
  • n_jobs=6: I have a 6-core CPU, so I set this to 6 to speed up training.
  • verbose=1: This provides progress updates during training, which is helpful given that training takes a while.
In [45]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# This doesn't fit in the pipeline because it processes the target vectors
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

initial_model = make_pipeline(
    FunctionTransformer(transform_picks_bans),
    RandomForestClassifier(
        n_estimators=500,
        max_depth=16,
        max_features="sqrt",
        min_samples_split=16,
        min_samples_leaf=4,
        class_weight="balanced",
        random_state=42,
        n_jobs=6,
        verbose=1,
    ),
)

Let's fit the model and evaluate its performance on both the training and testing sets:

In [46]:
initial_model.fit(X_train, y_train_encoded)

initial_train_pred = initial_model.predict(X_train)
initial_test_pred = initial_model.predict(X_test)

print(f"Training accuracy: {accuracy_score(y_train_encoded, initial_train_pred):.4f}")
print(f"Testing accuracy: {accuracy_score(y_test_encoded, initial_test_pred):.4f}")
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    1.1s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    4.8s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:   10.8s
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed:   12.1s finished
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    1.1s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    5.8s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:   13.4s
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed:   15.2s finished
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    0.2s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    1.5s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:    3.5s
Training accuracy: 0.8243
Testing accuracy: 0.6801
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed:    4.0s finished

Our model achieves a test set accuracy of 68.01%, which is not too shabby given that there are 220 unique patches to choose from. We also are not granting any partial points for being "close" to the correct patch version, so getting the patch way off is punished the same as getting really close but not exactly right (e.g. guessing 14.19 instead of 14.20 is punished the same as guessing 3.15 instead of 14.20). Although we will still use accuracy as our evaluation metric, we can also look at the mean absolute error of our predictions to see how close they are to the correct patch version—what we'll do here is take the absolute difference between the encoded patch versions and the predicted patch versions (which are simply incremental integers done by the LabelEncoder), and then take the mean of that absolute difference across all predictions:

In [47]:
def evaluate_predictions(y_true_encoded, y_pred_encoded, le):
    accuracy = accuracy_score(y_true_encoded, y_pred_encoded)
    mae = np.mean(np.abs(y_true_encoded - y_pred_encoded))

    abs_diffs = np.abs(y_true_encoded - y_pred_encoded)

    max_diff = 20
    within_n = []
    for n in range(max_diff + 1):
        within_n.append(np.mean(abs_diffs <= n))

    true_patches = le.inverse_transform(y_true_encoded)
    pred_patches = le.inverse_transform(y_pred_encoded)

    true_major = np.array([p.split(".")[0] for p in true_patches])
    pred_major = np.array([p.split(".")[0] for p in pred_patches])

    major_accuracy = accuracy_score(true_major, pred_major)

    print(f"Accuracy (exact match): {accuracy:.4f}")
    print(f"Major Patch Accuracy: {major_accuracy:.4f}")
    print(f"Mean Absolute Error (in patch steps): {mae:.4f}")

    return {
        "accuracy": accuracy,
        "major_accuracy": major_accuracy,
        "mae": mae,
        "within_n": within_n,
    }
In [48]:
print("\nTraining Metrics:")
initial_train_metrics = evaluate_predictions(y_train_encoded, initial_train_pred, le)
    
print("\nTest Metrics:")
initial_test_metrics = evaluate_predictions(y_test_encoded, initial_test_pred, le)
Training Metrics:
Accuracy (exact match): 0.8243
Major Patch Accuracy: 0.9810
Mean Absolute Error (in patch steps): 1.0207

Test Metrics:
Accuracy (exact match): 0.6801
Major Patch Accuracy: 0.9659
Mean Absolute Error (in patch steps): 1.8295

On average, our predictions are off by about 1.0207 patch steps (major or minor steps are both considered a single step, a slight drawback to this approach) for the training set, and 1.8295 patch steps for the testing set. If we only consider predicting a major patch, we have 98.10% accuracy on the training set and 96.59% accuracy on the testing set. We can graph how often our predictions fall within a certain number of patch steps from the actual patch version:

In [49]:
df = pd.DataFrame(
    {
        "Steps Away": range(len(initial_train_metrics["within_n"])),
        "Training Set": initial_train_metrics["within_n"],
        "Test Set": initial_test_metrics["within_n"],
    }
).melt(id_vars=["Steps Away"], var_name="Dataset", value_name="Accuracy")

fig = px.line(
    df,
    x="Steps Away",
    y="Accuracy", 
    color="Dataset",
    markers=True,
    color_discrete_map={"Training Set": "#C8AA6E", "Test Set": "#785A28"},
    template="plotly_dark",
    width=650,
    height=600
)

fig.update_layout(
    paper_bgcolor='#0A1428',
    plot_bgcolor='#0A1428',
    title={
        "text": "Model Accuracy Within N Patch Versions",
        "font": {"family": "Beaufort"},
        "y": 0.95
    },
    font=dict(
        family="OpenSansRegular",
        color='#f0e6d2'
    ),
    xaxis={
        "title": {
            "text": "Number of Patch Versions (Steps) Away",
            "font": {"family": "OpenSansRegular"}
        },
        "tickfont": {"family": "OpenSansRegular"},
        "tickmode": "linear"
    },
    yaxis={
        "title": {
            "text": "Cumulative Accuracy",
            "font": {"family": "OpenSansRegular"}
        },
        "tickfont": {"family": "OpenSansRegular"}
    },
    showlegend=True,
    legend={"x": 0.85, "y": 0.15},
    hovermode="x unified"
)

for dataset in ["Training Set", "Test Set"]:
    mask = df["Dataset"] == dataset
    y_offset = 0.02 if dataset == "Training Set" else -0.02

    for _, row in df[mask].iterrows():
        fig.add_annotation(
            x=row["Steps Away"],
            y=row["Accuracy"],
            text=f"{row['Accuracy']:.2f}",
            showarrow=False,
            yshift=10 if dataset == "Training Set" else -10,
            font={"size": 8, "family": "OpenSansRegular", "color": "#f0e6d2"}
        )

fig.show()
In [50]:
fig.write_html('charts/initial-accuracy-within-n.html', include_plotlyjs='cdn')

As we can see, the model's accuracy significantly increases when you give it some leniency:

  • With a leniency of 1 patch step, the model has a training accuracy of ~93% and a testing accuracy of ~88%
  • With a leniency of 2 patch steps, the model has a training accuracy of ~95% and a testing accuracy of ~92%

This is extraordinarily good, considering that there are 220 unique patch versions (i.e. the task is very granular). We can inch out a bit more performance on this model in the next step.

Step 7: Final Model¶

As of now, our model only considers the presence of the champion (whether they were picked or banned), but the model currently does not consider the order in which the champions are picked or banned, which is highly indicative of the meta. Although I have done multiple attempts to incorporate the draft phase information into the model, it seems like it introduces too much noise that the signal is drowned out and the accuracy suffers. Although the entire draft phase is available to our disposal, we will only be creating two new features: whether the champion was first picked by either team, or whether the champion was first banned. This is a change that can be done to our current transform_picks_bans function, thus creating transform_picks_bans_enhanced that will be used in our final model:

In [51]:
def transform_picks_bans_enhanced(X):
    all_picks_bans = []
    for column in ["blue_picks", "red_picks", "blue_bans", "red_bans"]:
        picks_bans = X[column]
        all_picks_bans.extend([champ for picks in picks_bans for champ in picks])
    unique_champions = sorted(set(all_picks_bans))

    n_samples = len(X)
    n_champions = len(unique_champions)

    presence_features = np.zeros((n_samples, n_champions))
    first_ban_features = np.zeros((n_samples, n_champions))
    first_pick_features = np.zeros((n_samples, n_champions))

    champion_to_idx = {champ: idx for idx, champ in enumerate(unique_champions)}

    for i, (col, row) in enumerate(X.iterrows()):
        all_champs = (
            row["blue_picks"] + row["red_picks"] + row["blue_bans"] + row["red_bans"]
        )
        for champ in all_champs:
            if champ in champion_to_idx:
                idx = champion_to_idx[champ]
                presence_features[i, idx] = 1

        first_blue_ban = row["blue_bans"][0] if len(row["blue_bans"]) > 0 else None
        first_red_ban = row["red_bans"][0] if len(row["red_bans"]) > 0 else None

        for first_ban in [first_blue_ban, first_red_ban]:
            if first_ban in champion_to_idx:
                idx = champion_to_idx[first_ban]
                first_ban_features[i, idx] = 1

        first_blue_pick = row["blue_picks"][0] if len(row["blue_picks"]) > 0 else None
        first_red_pick = row["red_picks"][0] if len(row["red_picks"]) > 0 else None

        for first_pick in [first_blue_pick, first_red_pick]:
            if first_pick in champion_to_idx:
                idx = champion_to_idx[first_pick]
                first_pick_features[i, idx] = 1

    presence_df = pd.DataFrame(
        presence_features, columns=[f"{champ}_presence" for champ in unique_champions]
    )
    first_ban_df = pd.DataFrame(
        first_ban_features, columns=[f"{champ}_first_ban" for champ in unique_champions]
    )
    first_pick_df = pd.DataFrame(
        first_pick_features,
        columns=[f"{champ}_first_pick" for champ in unique_champions],
    )

    return pd.concat([presence_df, first_ban_df, first_pick_df], axis=1)
In [52]:
final_model = make_pipeline(
    FunctionTransformer(transform_picks_bans_enhanced),
    RandomForestClassifier(
        n_estimators=500,
        max_depth=16,
        max_features="sqrt",
        min_samples_split=16,
        min_samples_leaf=4,
        class_weight="balanced",
        random_state=42,
        n_jobs=6,
        verbose=1,
    ),
)

Creating our pipeline and training the model on the same train-test split as the previous model:

In [54]:
final_model.fit(X_train, y_train_encoded)

final_train_pred = final_model.predict(X_train)
final_test_pred = final_model.predict(X_test)

print(f"Training accuracy: {accuracy_score(y_train_encoded, final_train_pred):.4f}")
print(f"Testing accuracy: {accuracy_score(y_test_encoded, final_test_pred):.4f}")
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    1.3s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    6.5s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:   15.1s
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed:   17.2s finished
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    1.5s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    6.1s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:   13.7s
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed:   15.5s finished
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    0.2s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    1.3s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:    3.3s
Training accuracy: 0.8274
Testing accuracy: 0.6876
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed:    3.8s finished
In [55]:
print("\nTraining Metrics:")
final_train_metrics = evaluate_predictions(y_train_encoded, final_train_pred, le)

print("\nTest Metrics:")
final_test_metrics = evaluate_predictions(y_test_encoded, final_test_pred, le)
Training Metrics:
Accuracy (exact match): 0.8274
Major Patch Accuracy: 0.9821
Mean Absolute Error (in patch steps): 0.9559

Test Metrics:
Accuracy (exact match): 0.6876
Major Patch Accuracy: 0.9697
Mean Absolute Error (in patch steps): 1.6002

Perfoming a comparison between the two models:

In [56]:
pd.concat(
    {
        "Old Model": pd.DataFrame(
            {
                "Train": [
                    f"{initial_train_metrics['accuracy']:.4f}",
                    f"{initial_train_metrics['major_accuracy']:.4f}", 
                    f"{initial_train_metrics['mae']:.4f}",
                ],
                "Test": [
                    f"{initial_test_metrics['accuracy']:.4f}",
                    f"{initial_test_metrics['major_accuracy']:.4f}",
                    f"{initial_test_metrics['mae']:.4f}",
                ],
            },
            index=["Accuracy", "Major Patch Accuracy", "MAE"],
        ),
        "New Model": pd.DataFrame(
            {
                "Train": [
                    f"{final_train_metrics['accuracy']:.4f}",
                    f"{final_train_metrics['major_accuracy']:.4f}",
                    f"{final_train_metrics['mae']:.4f}",
                ],
                "Test": [
                    f"{final_test_metrics['accuracy']:.4f}", 
                    f"{final_test_metrics['major_accuracy']:.4f}",
                    f"{final_test_metrics['mae']:.4f}",
                ],
            },
            index=["Accuracy", "Major Patch Accuracy", "MAE"],
        ),
        "Improvement": pd.DataFrame(
            {
                "Train": [
                    f"{(final_train_metrics['accuracy'] - initial_train_metrics['accuracy']):.4f}",
                    f"{(final_train_metrics['major_accuracy'] - initial_train_metrics['major_accuracy']):.4f}",
                    f"{(initial_train_metrics['mae'] - final_train_metrics['mae']):.4f}",
                ],
                "Test": [
                    f"{(final_test_metrics['accuracy'] - initial_test_metrics['accuracy']):.4f}",
                    f"{(final_test_metrics['major_accuracy'] - initial_test_metrics['major_accuracy']):.4f}",
                    f"{(initial_test_metrics['mae'] - final_test_metrics['mae']):.4f}",
                ],
            },
            index=["Accuracy", "Major Patch Accuracy", "MAE"],
        ),
    },
    axis=1,
)
Out[56]:
Old Model New Model Improvement
Train Test Train Test Train Test
Accuracy 0.8243 0.6801 0.8274 0.6876 0.0031 0.0075
Major Patch Accuracy 0.9810 0.9659 0.9821 0.9697 0.0010 0.0038
MAE 1.0207 1.8295 0.9559 1.6002 0.0648 0.2293

As we can see, although the difference in improvement for accuracy and major patch accuracy is extremely small, the new model has improved MAE (as in if we're wrong, we're off by about 0.256 less patch steps on average). We can look at the $n$-step accuracy of the new model compared to the old model:

In [57]:
df = pd.DataFrame(
    {
        "Steps Away": range(len(final_train_metrics["within_n"])),
        "Training Set": final_train_metrics["within_n"],
        "Test Set": final_test_metrics["within_n"],
        "Training Set (Baseline)": initial_train_metrics["within_n"],
        "Test Set (Baseline)": initial_test_metrics["within_n"]
    }
).melt(id_vars=["Steps Away"], var_name="Dataset", value_name="Accuracy")

fig = px.line(
    df,
    x="Steps Away",
    y="Accuracy", 
    color="Dataset",
    markers=True,
    color_discrete_map={
        "Training Set (Baseline)": "#FF4E50",
        "Test Set (Baseline)": "#FF6B6E",
        "Training Set": "#C8AA6E", 
        "Test Set": "#785A28"
    },
    template="plotly_dark",
    width=650,
    height=600,
    category_orders={"Dataset": ["Training Set (Baseline)", "Test Set (Baseline)", "Training Set", "Test Set"]}
)

fig.update_layout(
    paper_bgcolor='#0A1428',
    plot_bgcolor='#0A1428',
    title={
        "text": "Model Accuracy Within N Patch Versions",
        "font": {"family": "Beaufort"},
        "y": 0.95
    },
    font=dict(
        family="OpenSansRegular",
        color='#f0e6d2'
    ),
    xaxis={
        "title": {
            "text": "Number of Patch Versions (Steps) Away",
            "font": {"family": "OpenSansRegular"}
        },
        "tickfont": {"family": "OpenSansRegular"},
        "tickmode": "linear"
    },
    yaxis={
        "title": {
            "text": "Cumulative Accuracy",
            "font": {"family": "OpenSansRegular"}
        },
        "tickfont": {"family": "OpenSansRegular"}
    },
    showlegend=True,
    legend={"x": 0.75, "y": 0.15},
    hovermode="x unified"
)

fig.update_traces(
    line=dict(dash="dash"),
    selector=dict(name="Training Set (Baseline)")
)
fig.update_traces(
    line=dict(dash="dash"),
    selector=dict(name="Test Set (Baseline)")
)

for dataset in ["Training Set", "Test Set"]:
    mask = df["Dataset"] == dataset
    y_offset = 0.02 if dataset == "Training Set" else -0.02

    for _, row in df[mask].iterrows():
        fig.add_annotation(
            x=row["Steps Away"],
            y=row["Accuracy"],
            text=f"{row['Accuracy']:.2f}",
            showarrow=False,
            yshift=10 if dataset == "Training Set" else -10,
            font={"size": 8, "family": "OpenSansRegular", "color": "#f0e6d2"}
        )

fig.show()
In [58]:
fig.write_html('charts/final-accuracy-within-n.html', include_plotlyjs='cdn')

This is the most I could inch out of this model, especially with the limitations of the features available (we literally only have the champion names and the draft order), the limitations of the model architecture itself (a simple random forest, since I haven't learned about deep learning yet), and the task being very granular. I am more than happy with this result.

Step 8: Fairness Analysis¶

For our fairness analysis, we will be determining if our model is fair to both "old" games and "new" games—that is, if we split the data down the middle by the median patch version, are we able to predict the patch version with the same accuracy for both sets?

To start, we must split the data down the middle. We create test_predictions to store the true and predicted patch versions, as well as the inverse transformed versions for readability:

In [59]:
test_predictions = pd.DataFrame(
    {
        "true_encoded": y_test_encoded,
        "pred_encoded": final_test_pred,
        "true": le.inverse_transform(y_test_encoded),
        "pred": le.inverse_transform(final_test_pred),
    }
)

test_predictions["error"] = np.abs(
    test_predictions["true_encoded"] - test_predictions["pred_encoded"]
)

display_df(test_predictions.sort_values(by="pred", ascending=True), rows=10)
true_encoded pred_encoded true pred error
16461 0 0 03.15 03.15 0
5555 0 0 03.15 03.15 0
7559 0 0 03.15 03.15 0
2064 0 0 03.15 03.15 0
13464 127 0 10.14 03.15 127
... ... ... ... ... ...
12785 218 218 14.21 14.21 0
8289 218 218 14.21 14.21 0
13593 201 218 14.02 14.21 17
264 219 218 14.22 14.21 1
4663 218 219 14.21 14.22 1

16560 rows × 5 columns

Let's find the "middle" of our test set by finding the median patch version, which is trivial given that we have incrementally encoded the patch versions:

In [60]:
print(
    f"Median patch version used for split: {le.inverse_transform([int(np.median(test_predictions['true_encoded']))])[0]}"
)
Median patch version used for split: 11.04

Given that our median patch version is 11.04, let's split the data into "early" and "late" games, label them accordingly, and calculate the accuracy for each period:

In [61]:
test_predictions["period"] = np.where(
    test_predictions["true_encoded"] < int(np.median(test_predictions["true_encoded"])),
    "Early",
    "Late",
)

early_acc = (
    test_predictions[test_predictions["period"] == "Early"]["true"]
    == test_predictions[test_predictions["period"] == "Early"]["pred"]
).mean()

late_acc = (
    test_predictions[test_predictions["period"] == "Late"]["true"]
    == test_predictions[test_predictions["period"] == "Late"]["pred"]
).mean()

acc_diff = early_acc - late_acc

print(f"Early Patches Accuracy: {early_acc:.4f}")
print(f"Late Patches Accuracy: {late_acc:.4f}")
print(f"Difference (Early - Late): {acc_diff:.4f}")

early_mae = test_predictions[test_predictions["period"] == "Early"]["error"].mean()
late_mae = test_predictions[test_predictions["period"] == "Late"]["error"].mean()
mae_diff = early_mae - late_mae

early_major_acc = (
    test_predictions[test_predictions["period"] == "Early"]["error"] <= 10
).mean()
late_major_acc = (
    test_predictions[test_predictions["period"] == "Late"]["error"] <= 10
).mean()
major_acc_diff = early_major_acc - late_major_acc

print("\nMean Absolute Error:")
print(f"Early Patches MAE: {early_mae:.4f}")
print(f"Late Patches MAE: {late_mae:.4f}")
print(f"Difference (Early - Late): {mae_diff:.4f}")

print("\nMajor Patch Accuracy:")
print(f"Early Patches Major Accuracy: {early_major_acc:.4f}")
print(f"Late Patches Major Accuracy: {late_major_acc:.4f}")
print(f"Difference (Early - Late): {major_acc_diff:.4f}")
Early Patches Accuracy: 0.6965
Late Patches Accuracy: 0.6790
Difference (Early - Late): 0.0176

Mean Absolute Error:
Early Patches MAE: 1.1868
Late Patches MAE: 2.0002
Difference (Early - Late): -0.8135

Major Patch Accuracy:
Early Patches Major Accuracy: 0.9833
Late Patches Major Accuracy: 0.9680
Difference (Early - Late): 0.0153

This will be our observed difference in accuracy between the two periods. To determine if this difference is statistically significant, we will perform a permutation test with the following null and alternative hypotheses:

  • Null hypothesis: The difference in accuracy between early and late patches is due to random chance, and any observed difference is not systematic.
  • Alternative hypothesis: There is a systematic difference in model accuracy between early and late patches that cannot be explained by random chance alone.

We use the difference in means as our test statistic since we're comparing the average accuracy between two independent groups. This is appropriate because accuracy is a proportion (between 0 and 1) and we want to measure if one group systematically performs better than the other. The difference in means directly captures this gap in performance:

In [62]:
res = permutation_test(
    data=(
        (
            test_predictions[test_predictions["period"] == "Early"]["true"]
            == test_predictions[test_predictions["period"] == "Early"]["pred"]
        ).values,
        (
            test_predictions[test_predictions["period"] == "Late"]["true"]
            == test_predictions[test_predictions["period"] == "Late"]["pred"]
        ).values,
    ),
    statistic=diff_in_means,
    permutation_type="independent",
    vectorized=False,
    n_resamples=1000,
    alternative="two-sided",
    random_state=42,
)

print(f"Observed difference in accuracy: {res.statistic:.4f}")
print(f"P-value: {res.pvalue:.4f}")

fig = plot_tvd_distribution(res, title="Null Distribution of Difference in Means: Early vs. Late Patches", x_label="Difference in Means", y_label="Frequency")
fig.show()
Observed difference in accuracy: 0.0176
P-value: 0.0340
In [63]:
fig.write_html('charts/fairness-analysis.html', include_plotlyjs='cdn')

We have an observed difference in means $= 0.0176$ and $p = 0.0340$. Assuming a significance level of 0.05, we can reject the null hypothesis—this suggests that the model is not fair to both "old" and "new" games. This is an intuitive result, as the game introduces more and more complexity as champions/items are added/reworked, which introduces noise for the latter half of the dataset and thus is where the model is more likely to be wrong.